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Abstract 

In this paper, we study two general classes of optimization algorithms 
for kernel methods with convex loss function and quadratic norm regu- 
larization, and analyze their convergence. The first approach, based on 
fixed-point iterations, is simple to implement and analyze, and can be 
easily parallelized. The second, based on coordinate descent, exploits the 
structure of additively separable loss functions to compute solutions of line 
searches in closed form. Instances of these general classes of algorithms 
are already incorporated into state of the art machine learning software 
for large scale problems. We start from a solution characterization of the 
regularized problem, obtained using sub-differential calculus and resol- 
vents of monotone operators, that holds for general convex loss functions 
regardless of differentiability. The two methodologies described in the pa- 
per can be regarded as instances of non-linear Jacobi and Gauss-Seidel 
algorithms, and are both well-suited to solve large scale problems. 

1 Introduction 

The development of optimization software for learning from large datasets is 
heavily influenced by memory hierarchies of computer storage. In presence of 
memory constraints, most of the high order optimization methods become un- 
feasible, whereas techniques such as coordinate descent or stochastic gradient 
descent may exploit the specific structure of learning functionals to scale well 
with the dataset size. Considerable eff ort has been devote d to make kernel 
methods feasible on large scale problems [Bottou et al. , 2007], One of the most 



important features of modern machine learning methodologies is the ability to 
leverage on sparsity in order to obtain scalability. Typically, learning methods 
that impose sparsity are based on the minimization of non-diffcrentiable objec- 
tive functionals. Is this the case of support vector machines or methods based 
on l\ regularization. 

In this chapter, we analyze optimization algorithms for a general class of reg- 
ularization fun ctionals, using sub -differential calculus and resolvents of mono- 
tone operators [Rockafellar . 1970. ?] to manage non-differentiabihty. In partic- 



ular, we study learning methods that can be interpreted as the minimization of 
a convex empirical risk term plus a squared n orm regularization into a repro- 
ducing kernel Hilbert space Aronszain . 1950l | Wk with non-null reproducing 
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kernel K, namely 



mm 

gen K 



f(g( Xl ),...,g( Xe )) + M3iK\ ) 



where / : M. e — > R+ is a finite-valued bounded below convex function. Regu- 
larization problems of the form (FT ]) admit a unique opt imal solution which, in 



CD admit a uniq i 
l [Scholkopf et all 



view of the representer theorem [Scholkopf et all l2001j , can be represented as 
a finite linear combination of kernel sections: 



We characterize optimal coefficients c, of the linear combination via a family 
of non-linear equations. Then, we introduce two general classes of optimiza- 
tion algorithms for large scale regularization methods that can be regarded as 
instances of non-linear Jacobi and Gauss-Seidel algorithms, and analyze their 
convergence properties. Finally, we state a theorem that shows how to reformu- 
late convex regularization problems, so as to trade off positive semidefiniteness 
of the kernel matrix for differentiability of the empirical risk. 



2 Solution characterization 

As a consequence of the representer theorem, an optimal solution of problem 
(CD can be obtained by solving finite-dimensional optimization problems of the 
form 

minF(c), F(c) = /(Kc) + ^ ) (2) 

where K £ Mf xe is a non-null symmetric positive semi-definite matrix called 
kernel matrix. The entries kij of the kernel matrix are given by 

where K : X x X ~> M. is a. positive semidefinite kernel function. It is easy to 
verify that the resulting kernel matrix is symmetric and positive semi-definite. 
Let ki (i — 1, ...,£) denote the columns of the kernel matrix. Particularly 
interesting is the case in which function / is additively separable. 

Definition 1 (Additively separable functional). A functional f : R. e — > R is 
called additively separable if 

i 

f(z) = J2f i (z i ). (3) 

i=l 

Parametric models with ii (ridge) regularization corresponds to the case in 
which inputs are n-dimensional numeric vectors {X = W 1 ) and the kernel matrix 
is chosen as K = XX T , where X 6 l' xn is a matrix whose rows are the input 
data Xi. Letting 

w := X T c, (4) 
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the following class of problems is obtained: 

mm (/CXti,) + M) . (5) 

Observe that one can optimize over the whole space R™, since the optimal 
weight vector will automatically be in the form Q . Parametric models with li 
regularization can be seen as specific instances of kernel methods in which K is 
the linear kernel: 

K(xi,x 2 ) = (xi,x 2 )2- 

In the following, two key mathematical objects will be used to characterize 
optimal solutions of problems ([2]) and ([5]). The first is the subdifferential df 
of the empirical risk. The second is the resolvent of the inverse subdifferential, 
defined as 

J a := (i + ap/r 1 ) -1 . (6) 

See the appendix for more details about these objects. The following result 
characterizes optimal solutions of problem ([2]) via a non-linear equation involv- 
ing J a . The characterization also holds for non-diffcrcntiable loss functions, and 
is obtained without introducing constrained optimization problems. The proof 
of Theorem [1] is given into the appendix. 

Theorem 1. For any a > 0, there exist optimal solutions of problem (0j such 
that 

c = — J a (aKc — c), (7) 

where J a is the resolvent of the inverse sub- differential (df) 1 , see f7j|]. 

The usefulness of condition J7]) depends on the possibility of computing closed- 
form expressions for the resolvent, which may not be feasible for general convex 
functionals. Remarkably, for many learning methods one can typically exploit 
the specific structure of / to work out closed-form expressions. For instance, 
when / is additively separable as in @, the sub-differential decouples with 
respect to the different components. In such a case, the computation of the 
resolvent reduces to the inversion of a function of a single variable, which can 
be often obtained in closed form. Indeed, in many supervised learning problems, 
additive separability holds, where fi(zi) = \~ 1 L(y.i, Zj), L : R x R — > R + is a loss 
function, and A > is a regularization parameter. TablefJlrcports the expression 
of the J a in correspondence with commonly used loss functions. When / is 
additively separable, the characterization ((?]) can be generalized as follows. 

Corollary 1. Assume that (0) holds. Then, for any on > 0, i = 1, . . . ,£, there 
exist optimal solutions of problem 0) such that 

Ci = -J l ai {aik[c- a), i = l,...,£, (8) 

where J^. are the resolvents of the inverse sub- differentials (dfi)~ , see (0|). 

In this paper, we analyze two iterative approaches to compute optimal so- 
lutions of problem ^ , based on the solution characterizations of Theorem [1] 
and Corollary[T] For both methods, we show that cluster points of the iteration 
sequence are optimal solutions, and we have 

minF(c) = lim F(c k ), 
3 



(9) 



Name 


Loss L(y 1 ,y 2 ) 


Operator —J a (v) 


Ll-SVM 
L2-SVM 
RLS 
RLA 
SVR 


(1 - 2/12/2)+ 
(1 ~ 2/12/2)+ 

(2/1 - 2/2)72 
I2/1 - 2/2 1 

(I2/1 — 2/2 1 -e)+ 


y min ((aA) -1 , (l-j/0 u) + ) 
V0(l-y©«) + /(l + aA) 
(y-w)/(l + aA) 
sign(y — w) min ((aA) 1 , |y — w|) 
sign(y -v)Q min ((aA)" 1 , (|y - v\ - e) + ) 



Table 1: Operator — J a for different methods. Some of the losses are expressed 
using the "positive part" function defined as (x)+ = max{0,ir}. In the right- 
most column, denotes the element-wise product, and functions are applied 
component- wise . 



where F denote the functional of problem © . Section [3] describes a first ap- 
proach, which involves simply iterating equation Q according to the fixed-point 
method. The method can be also regarded as a non-linear Jacobi algorithm to 
solve equation 0. It is shown that a can be always chosen so as to make the 
iterations approximate an optimal solution to arbitrary precision. In section 
21 we describe a second approach, that involves separately iterating the single 
components using the characterization of equation (|8|). For a suitable choice 
of on, the method boils down to coordinate descent, and optimality of cluster 
points holds whenever indices are picked according to an "essentially cyclical" 
rule. Equivalently, the method can be regarded as a non-linear Gauss-Seidel 
algorithm to solve ©. 

3 Fixed-point algorithms 

In this section, we suggest computing the optimal coefficient vector c of problem 
@ by simply iterating equation ([?])■ starting from any initial condition c°: 

c k+1 = -J a (aKc k -c k ). (10) 

Such procedure is the well-known fixed point iteration (also known as Picard or 
Richardson iteration) method. Provided that a is properly chosen, the proce- 
dure can be used to solve problem ^ to any given accuracy. Before analyzing 
the convergence properties of method (JTUJ), let's study the computational com- 
plexity of a single iteration. To this end, one can decompose the iteration into 
three intermediate steps: 

z k = Kc A , step 1 

v k = az k - c fc , step 2 

c k+1 = -J a (v k ). step 3 

The decomposition emphasize the separation between the role of the kernel 
(affecting only step 1) and the role of the function / (affecting only step 3). 

Step 1 

Step one is the only one that involves the kernel matrix. Generally, it is also the 
most computationally and memory demanding step. Since z = Kc represents 
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predictions on training inputs (or a quantity related to them) , it holds that being 
able to perform fast predictions also have a crucial impact on the training time. 
This is remarkable, since good prediction speed is a desirable goal on its own. 
Notice that an efficient implementation of the prediction step is beneficial for any 
learning method of the form ([2]), independently of /. Ideally, the computational 
cost of such matrix- vector multiplication is 0{£ 2 ). However, the kernel matrix 
might not fit into the memory, so that the time needed to compute the product 
might also include special computations or additional I/O operations. Observe 
that, if many components of vector c are null, only a subset of the rows of the 
kernel matrix is necessary in order to compute the product. Hence, methods that 
impose sparsity in vector c may produce a significant speed-up in the prediction 
step. As an additional remark, observe that the matrix-vector product is an 
operation that can be easily parallelized. 

In the linear case ([5]), the computation of z k can be divided in two parts: 

w k = X T c fc , 
z k = Xw k . 

In order to compute the product, it is not even necessary to form the kernel 
matrix, which may yields a significant memory saving. The two intermedi- 
ate products both need 0(n£) operations and the overall cost still scales with 
0(n£). When the number of features is much lower than the number of exam- 
ples (n <C I), there's a significant improvement with respect to 0(i 2 ). Speed-up 
and memory saving are even more dramatic when X is sparse. In such a case, 
computing the product in two steps might be more convenient also when n > £. 

Step 2 

Step two is a simple subtraction between vectors, whose computational cost is 
0(£). In section [SJ it is shown that v = aKc-c can be interpreted as the vector 
of predictions on the training inputs associated with another learning problem 
consisting in stabilizing a functional regularized whose empirical risk is always 
differentiable, and whose kernel is not necessarily positive. 

Step 3 

Step three is the only one that depends on function /. Hence, different algo- 
rithms can be implemented by simply choosing different resolvents J a . Table [5] 
reports the loss function L and the corresponding resolvent for some common 
supervised learning methods. Some examples are given below. Consider prob- 
lem with the "hinge" loss function L (2/1,2/2) = (1 — J/iJ/2) + , associated with 
the popular Support Vector Machine (SVM). For SVM, step three reads 

c fe+1 = 2/ min (J^-, (l-j/0 v k ) , 

where denotes the element-wise product, and min is applied element-wise. As 
a second example, consider classic regularized least squares (RLS). In this case, 
step three reduces to 

c k+i = v~v k 

l + a\' 
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Generally, the complexity of step three is 0(£) for any of the classical loss 
functions. 



3.1 Convergence 

The following result states that the sequence generated by the iterative pro- 
cedure (|10j) can be used to approximately solve problem @ to any precision, 
provided that a is suitably chosen. 

Theorem 2. If the sequence c k is generated according to algorithm U0\) . and 

0<a< Wh> 

then (0) holds. Moreover, c k is bounded, and any cluster point is a solution of 
problem @). 

A stronger convergence result holds when the kernel matrix is strictly pos- 
itive or / is differentiable with Lipschitz continuous gradient. Under these 
conditions, it turns out that the whole sequence c k converges at least linearly 
to an unique fixed point. 

Theorem 3. Suppose that the sequence c k is generated according to algorithm 
U0\) , where a satisfy ill)) , and one of the following conditions holds: 

1. The kernel matrix K is positive definite. 

2. Function f is everywhere differentiable and V/ is Lipschitz continuous, 

Then, there exists a unique solution c* of equation and c k converges to c* 
with the following rate 

\\c k+1 -c*\\ 2 < v\\c k -c*\\ 2 , 0< M <1. 

In practice, condition (TTT1) can be equivalently satisfied by fixing a = 1 and 
scaling the kernel matrix to have spectral norm between and 2. In problems 
that involve a regularization parameter, this last choice will only affect its scale. 
A possible practical rule to choose the value ofaisa = l/||K||2, which is equiv- 
alent to scale the kernel matrix to have spectral norm equal to one. However, 
in order to compute the scaling factor in this way, one generally needs all the 
entries of the kernel matrix. A cheaper alternative that uses only the diagonal 
entries of the kernel matrix is a = f/tr(K), which is equivalent to fix a to one 
and normalizing the kernel matrix to have trace one. To see that this last rule 
satisfy condition (TTT|) , observe that the trace of a positive semidefinite matrix 
is an upper bound for the spectral norm. In the linear case ([5]), one can di- 
rectly compute a on the basis of the data matrix X. In particular, we have 
||K||2 = ||X||2, and tr(K) — ||X||^, where || ■ \\p denotes the Frobenius norm. 



4 Coordinate-wise iterative algorithms 

In this section, we describe a second optimization approach that can be seen as 
a way to iteratively enforce optimality condition ([8]). Throughout the section, 
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it is assumed that / is additively separable as in ([3]). In view of Corollary [TJ the 
optimality condition can be rewritten for a single component as in (J5J . Consider 
the following general update algorithm: 

cp l = -J t at {a i l$<*- < $), i = l,...J. (12) 

A serial implementation of algorithm (|10j) can be obtained by choosing on — a 
and by cyclically computing the new components c k+1 according to equation 
(fT2"]) . Observe that this approach requires to keep in memory both c k and c fc+1 
at a certain time. In the next sub-section, we analyze a different choice of 
parameters on that leads to a class of coordinate descent algorithms, based on 
the principle of using new computed information as soon as it is available. 

4.1 Coordinate descent methods 



Algorithm 1 Coordinate descent for regularized kernel methods 
while maxi \hi\ > S do 

Pick a coordinate index i according to some rule, 

z i i ) 

«? = Z i/ k U ~ 4, 

tmp = Si(v$), 
hi = tmp — , 
c k+1 = tmp, 
end while 



A coordinate descent algorithm updates a single variable at each iteration 
by solving a sub-problem of dimension one. During the last years, optimiza- 
tion via coordinate descent is becoming a popular approach in machine learn- 



vorable computational properties 


Friedman et al., 20071 Tseng and Yun 


2008. 


Wu and LaneeLl2008LIChane: et al.. 


2008. Hsieh et al. 2008. Yun and Toh 


2009, 


Huane et al.. 2010. Friedman et al 


., 20101. Although the method may require 



many iterations to converge, the specific structure of supervised learning ob- 
jective functionals allows to solve the sub-problems with high efficiency. This 
makes the approach competitive especially for large-scale problems, in which 
memory limitations hinder the use of second order optimization algorithms. As 
a matter of fact, state of th e art s olvers for large scale supervised learning such 
as glmnet 



IFan et al 



Friedman et al. . 2010| for generalized linear models, or LIBLINEAR 
2008| for SVMs are based on coordinate descent techniques. 



The update for c k in algorithm (TT21 also depends on components c k with 
j < i which have already been updated. Hence, one needs to keep in memory 
coefficients from two subsequent iterations c k+1 and c . In this sub-section, we 
describe a method that allows to take advantage of the computed information 
as soon as it is available, by overwriting the coefficients with the new values. 
Assume that the diagonal elements of the kernel matrix are strictly positive, 
i.e. ku > 0. Notice that this last assumption can be made without any loss 
of generality. Indeed, if ka — for some index i then, in view of the inequal- 
k; 



ity 



< 



Wkiikjj, it follows that 



k- ■ — 



for all j. Hence, the whole i-th 



column (row) of the kernel matrix is zero, and can be removed without affect- 
ing optimization results for the other coefficients. By letting a, = l/ka and 
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Si := — Jj fe . s_! in equation (|SJ|, the i-th coefficient in the inner sum does cancel 
out, and we obtain 




= * V^c, . (13) 



The optimal z-th coefficient is thus expressed as a function of the others. Sim ilar 
characterizations have been also derived in [Dinuzzo and De" Nicolaol [20091 ] for 



several loss functions. Equation (1131) is the starting point to obtain a variety 
of coordinate descent algorithms involving the iterative choice of a a coordinate 
index i followed by the optimization of Cj as a function of the other coefficients. 
A simple test on the residual of equation ([TBI can be used as a stopping condi- 
tion. The approach can be als o regarded as a non-linear Gauss-Seidel method 
Ortega and Rheinboldt . 2000| | for solving the equations ©. It is assumed that 
vector c is initialized to some initial c°, and coefficients hi are initialized to the 
residuals of equation (|13[) evaluated in correspondence with c . Remarkably, in 
order to implement the method for different loss functions, we simply need to 
modify the expression of functions Si. Each update only involves a single row 
(column) of the kernel matrix. In the following, we will assume that indices are 
recursively p i cked according to a rule t hat satisfy the following condition, see 



Tsend . l200ll iLuenbereer and Yd . [200 



Essentially Cyclic Rule. There exists a constant integer T > £ such that 
every index i € {1, . . . ,£} is chosen at least once between the fc-th iteration and 
the (k + T- l)-th, for all k. 

Iterations of coordinate descent algorithms that use an essentially cyclic rule 
can be grouped in macro-iterations, containing at most T updates of the form 
(fTB"]) . within which all the indices are picked at least once. Below, we report 
some simple rules that satisfy the essentially cyclic condition and don't require 
to maintain any additional information (such as the gradient): 

1. Cyclic rule: In each macro-iteration, each index is picked exactly once 
in the order l,...,t. Hence, each macro- iteration consists exactly of £ 
iterations. 

2. Aitken double sweep rule: Consists in alternating macro-iterations in 
which indices are chosen in the natural order 1, . . . , £ with macro- iterations 
in the reverse order, i.e. (£ — 1), . . . , 1. 

3. Randomized cyclic rule: The same as the cyclic rule, except that in- 
dices are randomly permuted at each macro-iteration. 

In the linear case ([5]), z\ can be computed as follows 

w k =Xc fc , 
4' = xfw k . 

By exploiting the fact that only one component of vector c changes from an 
iteration to the next, the first equation can be further developed: 

w k = X T c fe = + (X T e p )h p = w^ 1 + Xphp 
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where p denotes the index chosen in the previous iteration, and h p denotes the 
variation of coefficient c p in the previous iteration. By introducing these new 
quantities, the coordinate descent algorithm can be rewritten as in Algorithm 
where we have set 5, := — Jl M _ a . 



Algorithm 2 Coordinate descent (linear kernel) 
while maxi \hi\ > S do 

Pick a coordinate index i according to some rule, 
if hp 7^ then 

w k — w k ^ 1 + Xphp, 
end if 

b T h 

z* = x\ w K , 

«? = «?/imi§ - °h 

tmp = Si(v£), 
hi = tmp — c\ , 
c k+1 = tmp, 
p = i 
end while 



The computational cost of a single iteration depends mainly on the updates 
for w and Zj, and scales linearly with the number of features, i.e. 0(n). When 
the loss function have linear traits, it is often the case that coefficient q doesn't 
change after the update, so that hi — 0. When this happen, the next update of 
to can be skipped, obtaining a significant speed-up. Further, if the vectors x» 
are sparse, the average computational cost of the second line may be much lower 



than 0(n). A technique of this kind has be en proposed in I Hsieh et al.l . 1200 



and implemented in the package LIBLINEAR [Fan et all l2008j to improve speed 



of coordinate descent iterations for linear SVM training. Here, one can see that 
the same technique can be applied to any convex loss function, provided that 
an expression for the corresponding resolvent is available. 

The main convergence result for coordinate descent is stated below. It should 
be observed that the classical theory of convergence for coordinate descent is 
typically formulated for differentiable objective functionals. When the objec- 
tive functional is not differentiable, there exist count erexamples show ing that 
the method may get stuck in a non-stationary point Auslender . 1976| . In the 



non-differentiable case, o ptimality of cl uster points of coordinate descent itera- 
tions has been proven in TsensJ . l200lj (see also references therein) , under the 



additional assumption that the nqn-diffe rentiable part is additively separable. 
Unfortunately, the result of Tsenel 200 lj cannot be directly applied to problem 



([2]), since the (possibly) non-differential part /(Kc) is not separable with respect 
to the optimization variables Cj, even when Q holds. Notice also that, when 
the kernel matrix is not strictly positive, level sets of the objective functional 
are unbounded (see Lemma[T]in the appendix). Despite these facts, it still holds 
that cluster points of coordinate descent iterations are optimal, as stated by the 
next Theorem. 

Theorem 4. Suppose that the following conditions hold: 

1. Function f is additively separable as in |3J), 

2. The diagonal entries of the kernel matrix satisfy kn > 0, 
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3. The sequence c k is generated by the coordinate descent algorithm (Algo- 
rithm^ or\^\) , where indices are recursively selected according to an essen- 
tially cyclic rule. 

Then, holds, c k is bounded, and any cluster point is a solution of problem 

5 A reformulation theorem 

The following result shows that solutions of problem © satisfying equation @ 
are also stationary points of a suitable family of differentiablc functionals. 

Theorem 5. If c satisfy fTp, then it is also a stationary point of the following 
functional: 

F a (c)=a- 1 f a (K a c) + ^^, 

where f a denotes the Moreau-Yosida regularization of f, and K Q := aK — I. 

Theorem [5] gives an insight into the role of parameter a, as well as providing 
an interesting link with machine learning with indefinite kernels. By the prop- 
erties of the Moreau-Yosida regularization, f a is differentiable with Lipschitz 
continuous gradient. It follows that F a also have such property. Notice that 
lower values of a are associated with smoother functions f a , while the gradi- 
ent of a~ 1 f a is non-expansive. A lower value of a also implies a "less positive 
semidefinite" kernel, since the eigenvalues of K Q are given by (atoti — 1), where cti 
denote the eigenvalues of K. Indeed, the kernel becomes non-positive as soon 
as amhii{ai} < 1. Hence, the relaxation parameter a regulates a trade-off 
between smoothness of f a and positivity of the kernel. 

When / is additively separable as in ([3]), it follows that f a is also additively 
separable: 

£ 

fa{z) = ^2fia(Zi), 
i=l 

and fi a is the Moreau-Yosida regularization of /j. The components can be often 
computed in closed form, so that an "equivalent differentiable loss function" can 
be derived for non-differentiable problems. For instance, when /j is given by 
the hinge loss fiizi) = (1 — yiZi) + , letting a — 1, we obtain 

f / x _ J 1/2 - ViZi, yizi < 
teW-X (l-tfi^/2, yi Zi>0 

Observe that this last function is differentiable with Lipschitz continuous deriva- 
tive. By Theorem O it follows that the SVM solution can be equivalently com- 
puted by searching the stationary points of a new regularization functional ob- 
tained by replacing the hinge loss with its equivalent differentiable loss function, 
and modifying the kernel matrix by subtracting the identity. 

6 Conclusions 

In this paper, fixed-point and coordinate descent algorithms for regularized ker- 
nel methods with convex empirical risk and squared RKHS norm regularization 
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have been analyzed. The two approaches can be regarded as instances of non- 
linear Jacobi and Gauss-Seidel algorithms to solve a suitable non-linear equation 
that characterizes optimal solutions. While the fixed-point algorithm has the 
advantage of being parallelizable, the coordinate descent algorithm is able to 
immediately exploit the information computed during the update of a single 
coefficient. Both classes of algorithms have the potential to scale well with the 
dataset size. Finally, it has been shown that minimizers of convex regularization 
functionals are also stationary points of a family of diffcrcntiable regularization 
functionals involving the Moreau-Yosida regularization of the empirical risk. 

Appendix A 

In this section, we review some concepts and theorems from analysis and linear 
algebra, which are used in the proofs. Let E denote an Euclidean space endowed 
with the standard inner product (x\, x 2 ) 2 = xfx 2 and the induced norm ||.t|| 2 = 
y/{x,x) 2 . 

Set-valued maps 

A set-valued map (or multifunction) A : E — >• 2 E is a rule that associate to each 
point x G E a subset A(x) C E. Notice that any map A : E — > E can be seen as 
a specific instance of multifunction such that A(x) is a singleton for all x G E. 
The multi-function A is called monotone whenever 

(yi ~ 2/2, x\ - 22)2 > 0, Vxi,x 2 GE, yi&A(xi), y 2 e A(x 2 ), 
If there exists L > such that 

hi - 2/2 1| 2 < L\\xi - X2W2, Vxi,x 2 eE, jieifn), y 2 e A(x 2 ), 

then A is single- valued, and is called Lipschitz continuous function with modulus 
L. A Lipschitz continuous function is called nonexpansive if L = 1, contractive 
if L < 1, and firmly non-expansive if 

\\yi - J/2III < (yi -yi,x\ -x 2 ) 2 , Vxi,x 2 gE, yi€A(xi), y 2 ^A(x 2 ). 

In particular, firmly non-expansive maps are single- valued, monotone, and non- 
expansive. For any monotone multifunction A, its resolvent j£ is defined for 
any a > as ■= (1 + aA) , where I stands for the identity operator. 
Resolvents of monotone operators are known to be firmly non-expansive. 

Finite-valued convex functions 

A function / : E — > K is called finite-valued convex if, for any a e [0, 1] and any 
x\,x 2 <E E, it satisfy 

-00 < f(ax\ + (1 - ct)x 2 ) < af(xi) + (1 - a)f(x 2 ) < +00 

The subdiffcrential of a finite-valued convex function / is a multifunction df : 
E ->■ 2 E defined as 

df(x) = U G E : f(y) - f(x) > {£,y - x) 2 , Vy G E} . 

It can be shown that the following properties hold: 
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1. df(x) is a non-empty convex compact set for any iEE, 

2. / is (Gateaux) diffcrcntiable at x if and only if df(x) = {V/(x)} is a 
singleton (whose unique element is the gradient). 

3. df is a monotone multifunction. 

4. The point x* is a (global) minimizer of / if and only if € df(x*). 

For any finite- valued convex function /, its Moreau-Yosida regularization (or 
Moreau envelope, or quadratic min-convolution) is defined as 



/ ot 
f a {x) := mm ( f(y) + -\\ 

yeE \ Z 



y-x\\l 



For any fixed x, the minimum in the definition of f a is attained at y = p a (x), 
where p a := (1 + a _1 9/) denotes the so-called proximal mapping. It can be 
shown that the following remarkable properties hold: 

1. f a is convex differentiable, and the gradient V/ Q is Lipschitz continuous 
with modulus 1/a. 

2- f a (x) = f(p a (x)) + %\\p a (x)-x\\i 

3. f a and / have the same set of minimizers for all a. 

4. The gradient Vf a is called Moreau-Yosida regularization of df, and satisfy 

V/ a (af) = a{x -p a {x)) = aJ a (x), 
where J a denote the resolvent of the inverse sub-differential defined as 

J a := (l + aidf)- 1 )'' 



Convergence theorems 

Theorem 6 (Contraction mapping theorem). Let A : E — > E and suppose that, 
given c° , the sequence c k is generated as 

c k+1 = A(c k ). 

If A is contractive with modulus fi, then there exists a unique fixed-point c* such 
that c* = A(c*), and the sequence c k converges to c* at linear rate: 

\\c k+1 ~ c*\\ 2 < ^\\c k - c*\\ 2 , 0< M < 1. 



The following result is know as Zangwill's conve rgence theorem Zangwill . 1969j . 
see also page 206 of [Luenbereer and Yell2008lj . 

Theorem 7 (Zangwill's convergence theorem). Let A : E — > 2 E denote a mul- 
tifunction, and suppose that, given c° , the sequence c k is generated as 

c k+i g A ( c ky 

Let r C E called solution set. // the following conditions hold: 
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1. The graph Ga — {(x,y) £ExE:j/£ A(x)} is a closed set, 

2. There exists a descent function F such that 

• For all ier, F(A(x)) < F(x), 

• For all x£T, F(A(x)) < F(x), 

3. The sequence c k is bounded, 

then all the cluster points of c k belongs to the solution set. 

Appendix B 

The following Lemma will prove useful in the subsequent proofs. 

Lemma 1. The functional F of problem ^) is such that F(c + u) = F{c), for 
any vector u in the nullspace of the kernel matrix. 

Proof. Let u denote any vector in the nullspace of the kernel matrix. Then, we 
have 

F(c + u) = f (K(e + u)) + ^ L_J L = f ( Kc ) + — = F(c). 

□ 

Proof of Theorem QJ Problem ([2]> is a convex optimization problem, where the 
functional F is continuous and bounded below. First of all, we show that there 
exists optimal solution. Observe that minimization can be restricted to the 
range of the kernel matrix. Indeed, any vector c S E can be uniquely decom- 
posed as c = u + v, where u belongs to the nullspace of K and v belongs to the 
range. By Lemma [1] we have F(c) = F(v). Since F is coercive on the range of 
the kernel matrix (limi| t ,|| 2 _ ! . +00 F(v) = +oo), it follows that there exist optimal 
solutions. 

A necessary and sufficient condition for c* to be optimal is 

e dF{c*) = K (df (Kc*) + c*) = KG(c*), G{c*) := df (Kc*) + c. 

Consider the decomposition G(c*) = uq + vq, where uq belongs to the nullspace 
of the kernel matrix and vq belongs to the range. Observe that 

v G = G(c*) -u G = G(c* - u G ). 

We have 

e KG(c*) = Kv G 6 G(c* - u G ) = v G , 

so that, for any optimal c*, there exists an optimal c = c* — u G such that 

0e<9/(Kc)+c. (14) 

By introducing the inverse sub-differential, equation (1141) can be rewritten as 

Kc G (dfy 1 (-c). 
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Multiplying by a > both sides and subtracting c, we obtain 

aKc — c G a (df)^ 1 (— c) — c. 

Finally, introducing the resolvent J a as in @, we have 

aKc — c G ( J Q ) 1 (— c) 

Since J Q is single-valued, equation (0 follows. □ 

Proof of Corollary [H Let's start from the sufficient condition for optimality 
(H3J). If ([3]) holds, then the subdiffcrential of / decouples with respect to the 
different components, so that there exist optimal coefficients c 2 ; such that 

Oedfi(kfc)+Ci, i = l, ...,£. 

Equivalently, 

kjc g (Of,)- 1 (-a). 

Multiplying by 014 > both sides and subtracting Ci, we have 

cukj c - a G aj (dfiy 1 {-Ci) - Ci. 
The thesis follows by introducing the resolvents J^. and solving for — a. □ 



Proof of Theorem We show that the sequence c k generated by algorithm (fT0|) 
converges to an optimal solution of Problem ((2]). By Theorem [1] there exists 
optimal solutions c* satisfying ([7]). We now observe that any other vector c such 
that K(c* — c) = is also optimal. Indeed, we have c = c*+u, where u belongs to 
the nullspace of the kernel matrix. By Lemma[TJ it follows that F(c) = F(c*). 
To prove ©, it suffices to show that Kr fc — s> 0, where r k := c k — c* can be 
uniquely decomposed as 

r k = u k + v k , Ku k = 0, (u k ,v k ) 2 = 0. 

We need to prove that \\v k \\2 —> 0. Since J a is nonexpansive, we have 



l k+1 := l|r fc+1 ||I = ||c fe+1 -c*|| 



2 

2 



= || J a {aKc k - c k ) - J a (aKc* - c ;N2 
< \\aKr k -r k \\ 2 2 
= \\aKv k -r k f 2 . 

Observing that v k is orthogonal to the nullspace of the kernel matrix, we can 
further estimate as follows 



\\aKv k - r k \\ 2 2 = 7 fc - v kT (2aK - a 2 K 2 ) v 3 < 7 fe - P\\v k \\ 2 2 , 

where 

fj := min ucti(2 — acti). 

i:ai>0 

and on denote the eigenvalues of the kernel matrix. Since the kernel matrix is 
positive semidefinite and condition (fTT|) holds, we have 

< aou < 2. 
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Since the kernel matrix is not null and have a finite number of eigenvalues, 
there's at least one eigenvalue with strictly positive distance from zero. It follows 
that P > 0. Since 

< 7 fe+1 < 7° ~/3£ 

3=1 

we have, necessarily, that ||u||2 — > 0. Finally, observe that c k remains bounded 

\\c k \\ 2 <\\r k \\ 2 + \\c*\\ 2 <\\r% + \\c*\\ 2 , 

so that there's a subsequence converging to an optimal solution. In fact, by © 
it follows that any cluster point of c k is an optimal solution. □ 

Proof of Theorem [3J Algorithm (1101) can be rewritten as 

c fe+i = A ^ 

where the map A : E —> E is defined as 

A(c) := -J a (aKc - c) . 

Under both conditions (1) and (2) of the theorem, we show that A is contractive. 
Uniqueness of the fixed-point, and convergence with linear rate will then follow 
from the contraction mapping theorem (Theorem [6]). Let 

/ii := \\oiK — X 1 1 2 — max |1 — aja|, 

i 

where at denote the eigenvalues of the kernel matrix. Since the kernel matrix 
is positive semidcfinitc, and condition (|11[) holds, we have 

< diet < 2, 

so that fi\ < 1. We now show that the following inequality holds: 

\\J a {Vl)-Jafo)h<l*i\\Vl-V2h, (15) 

where 

1 \ -V 2 



^ ■= y 1 1 V 

and L denotes the Lipschitz modulus of V/ when / is difierentiable with Lip- 
schitz continuous gradient, and L = +oo otherwise. Since J a is nonexpansive, 
it is easy to see that (fT5|) holds when L = +oo. Suppose now that / is difier- 
entiable and V/ is Lipschitz continuous with modulus L. It follows that the 
inverse gradient satisfies 

ikv/tV) - (v/r^ib > ^\\ Xl - X2 \\ 2 . 

Since (V/) -1 is monotone, we have 

II^Ozi) " J«\x2)\\l = \\xx -x 2 + (V/)- 1 ^) - (Vf)-\x 2 )\\l 

>\\xi~x 2 \\l + \\{yf)-\ Xl )-{vf)-\x 2 )\\l 
- + \\ Xl ~ x ^- 
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From this last inequality, we obtain (fT5|) . Finally, we have 



\\A(ci) - A(c 2 )\\ 2 = \\J Q (aKci - c x ) - J a («Kc 2 - c 2 ) || 2 
</X2||(aK-I)(ci-pa)||2 
< - c 2 || 2 , 

where we have set fi := /ii/i2- Consider the case in which K is strictly positive 
definite. Then, it holds that 

< ana < 2, 

so that fix < 1, and A is contractive. Finally, when / is differentiable and 
V/ is Lipschitz continuous, we have fi 2 < 1 and, again, it follows that A is 
contractive. By the contraction mapping theorem (Theorem ^ , there exists a 
unique c* satisfying ([7]), and the sequence c k of Picard iterations converges to 
c* at a linear rate. □ 

Proof of Theorem [^J We shall apply Theorem[7]to the coordinate descent macro- 
iterations, where the solution set Y is given by 

T := {c 6 E : © holds} . 

Let A denote the algorithmic map obtained after each macro-iteration of the 
coordinate descent algorithm. By the essentially cyclic rule, we have 

ceA(c)= |J {(^c-o^JCc)}, 
(ii,...,i,)£/ 

where I is the set of strings of length at most s = Ton the alphabet {1, . . . ,£} 
such that all the characters are picked at least once. Observing that the set 
I has finite cardinality, it follows that the graph Ga is the union of a finite 
number of graphs of point-to-point maps: 

G A = (J {(x,y) €ExE:y=(A il o.-.oA is ){x)}. 

(ii,...,j,)ei 

Now notice that each map Ai is of the form 

Aj(c) = c + eiU(c), ti(c) := Si ly^^Cj] - c*. 

\i# 11 ) 

All the resolvents are Lipschitz continuous, so that functions Ai are also Lip- 
schitz continuous. It follows that the composition of a finite number of such 
maps is continuous, and its graph is a closed set. Since the union of a finite 
number of closed sets is also closed, we obtain that Ga is closed. 

Each map Ai yields the solution of an exact line search over the i-th coordi- 
nate direction for minimizing functional F of Problem ©. Hence, the function 

<t>i{t) = F(c + eit), 

is minimized at U(c), that is 

g dUU(c)) = (e t , dF(c + e i t i (c))) 2 = (k h 3/(KA t (c)) + A t (c)) 2 . 
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Equivalently, 

- {k l ,A l {c)) 2 ^(k u df{KA l (c))) 2 . (16) 
By definition of subdifferential, we have 

f(KAi{c)) - f(Kc) < ti(c) 7 , V 7 G (fc l ,a/(KA J (c))) 2 . 

In particular, in view of f| X6|) . we have 

f(KA t (c)) - /(Kc) < ~t l (c)(k l ,A l (c)) 2 . 

Now, observe that 

F(A{c)) < F(Ai(c)) = F(c + eMc)) 

= F(c) + t\{c) h -f + ti(c)(ki,c) 2 + f(KA t (c)) - /(Kc) 

< F(c) +tj(c) , Y + t t (c)(h,c)2 -ti(c)(ki, A l (c)) 2 

= F(c)+t 2 i (c) l f+t i (c)(k i ,c-A i (c)) 2 

= F{c)+tKc) k -f-tj{c)k vl 

= F(c)-tUc)'f. 

Since ku > 0, the following inequalities hold: 

f?(c) < ^ (F(c) - F(Mc))) < A (F(c) - F(A(c))) . (17) 

We now show that F is a descent junction for the map A associated with the 
solution set T. Indeed, if c satisfy ([8]), then the application of the map A doesn't 
change the position, so that 

F(A(c)) = F(c). 

On the other hand, if c does not satisfy (JSJ), there's at least one index i such 
that U(c) ^ 0. Since all the components are chosen at least once, and in view 
of (fT7|) . we have 

F(A(c)) < F(c). 

Finally, we need to prove that the sequence of macro-iterations remains bounded. 
In fact, it turns out that the whole sequence c k of iterations of the coordinate 
descent algorithm is bounded. From the first inequality in (|17p . the sequence 
F(c k ) is non- increasing and bounded below, and thus it must converge to a 
number 

Foe- lim F(c k )<F(c°). (18) 

k— >+oo 

Again from (|17[) . we obtain that the sequence of step sizes is square summable: 

V ||c fc+1 - c fe || < — (F(c°) - Foo) < +oo. 
' — ' z mm,- fc,-,- 
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In particular, step-sizes are also uniformly bounded: 



c K 2 <—— 

" z mm, kj 



(F(c°) - Foo) < +00. 



(19) 



Now, fix any coordinate i, and consider the sequence c k . Let hij denote the 
subsequence of indices in which the i-th component is picked by the essentially 
cyclic rule and observe that 



kjc h *i- x 
kii 



ha-l 



Recalling the definition of Si, and after some algebra, the last equation can be 
rewritten as 

e-dhikfc^ + kuti (c^- 1 )). 

Since dfi{x) is a compact set for any x € K, it suffices to show that the argument 
of the subdiffcrcntial is bounded. For any k, let's decompose c fc as 



u k + v k , 



Ku k = 0, 



(u k ,v k ) 2 = Q. 



Letting ct\ > denote the smallest non-null eigenvalue of the kernel matrix, we 
have 

ai||« fc |||j < v kT Kv k = c kT Kc k < 2F(c k ) < 2F(c°). 



By the triangular inequality, we have 



kjc k +kuU (c k )\<M 



kjc k 



+ U(c k ) 



< M 



kTc k 



+ \U (c k ) 



where M := max^ \kjj\. The first term can be majorized as follows: 
kjc k 



kjv k 



< 



v h < 



/2F(c°) 



< 



l2£F((P) 



< +00, 



a 1 



while the term tj (c fe ) I is bounded in view of (|T9^) . It follows that c k is bounded 
independently of i, which implies that c k is bounded. In particular, the subse- 
quence consisting of the macro-iterations is bounded as well. 

By Theorem [7| there's at least one subsequence of the sequence of macro- 
iterations converging to a limit that satisfies ([5]), and thus minimizes F. By 
continuity of F, we have 

-F(coo) = minF(c). 

cGR f 

Finally, in view of (|T8|) . we have F^ — F(coo), which proves (|9]) and shows that 
any cluster point of c k is an optimal solution of Problem @. □ 

Proof of Theorem [5l Equation ([7]) can be rewritten as 

J a (K a c) + c = 0. 

Now, let f a denote the Moreau-Yosida regularization of /. From the properties 
of f a , we have 

yf a (K a c) + ac = 0. 
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Multiplying both sides of the previous equation by a 1 K Q , we obtain 

a - 1 K ct V.f Q (K Q c) + K Q c = 0. 
Finally, the last equation can be rewritten as 

c T K a c 



a' 1 f a (K a c) + 



o. 



so that the thesis follows. □ 
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